fix: translate SIMILAR TO pattern to PostgreSQL-compatible regex#22273
fix: translate SIMILAR TO pattern to PostgreSQL-compatible regex#22273Dandandan wants to merge 1 commit into
Conversation
`SIMILAR TO` was lowered directly to a regex match without translating SQL wildcards or anchoring the pattern, so `'abc' SIMILAR TO 'a%'` returned false instead of true and `.`/`^`/`$` were treated as regex metacharacters instead of literals. The planner now translates literal `SIMILAR TO` patterns into an equivalent POSIX regex (anchored with `^...$`, `%`→`.*`, `_`→`.`, literal `.`/`^`/`$` escaped, backslash escape and bracket expressions preserved) before lowering to the existing regex-match operator. NULL patterns flow through as a typed Utf8 null. Non-literal patterns now return a clear `not_impl` error rather than a silently wrong result, and the SQL-layer pattern-type check is widened to accept `LargeUtf8` and `Utf8View` literals. Closes apache#22263. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
@Dandandan
I think the direction makes sense, but I found two blocking issues that need to be addressed before this is ready.
| // Backslash escapes the next character in SIMILAR TO. Emit it | ||
| // as a literal in the regex by re-escaping it. | ||
| match chars.next() { | ||
| Some(next) => { |
There was a problem hiding this comment.
I think this is still treating escaped characters as regex escapes rather than regex literals.
The new invariant says that \X in SIMILAR TO should mean a literal X, but this code turns the pattern \d into the regex ^\d$. That matches a digit and does not match d.
I can reproduce this on the branch with:
SELECT '5' SIMILAR TO '\d', 'd' SIMILAR TO '\d';This returns true, false, but I would expect false, true.
Could we escape the next character as a regex literal instead, for example by applying regex escaping to that single character, rather than always prefixing it with \?
| | ScalarValue::Null, | ||
| m, | ||
| ) => Expr::Literal(ScalarValue::Utf8(None), m.clone()), | ||
| _ => { |
There was a problem hiding this comment.
This looks like it only fixes literal patterns, while non-literal SIMILAR TO patterns are now rejected during physical planning.
That seems like a SQL-visible regression, since SIMILAR TO should accept pattern expressions. For example, this should still be evaluable rather than failing during planning:
SELECT s FROM test WHERE s SIMILAR TO s;Could we move the translation to a layer that can handle runtime pattern values, such as a physical expression or scalar kernel that translates the pattern column before regex matching? Alternatively, please preserve the currently supported non-literal cases while still enforcing SQL SIMILAR TO semantics end to end.
Which issue does this PR close?
SIMILAR TOshould treat%as a wildcard #22263.Rationale for this change
SIMILAR TOis supposed to mix SQL LIKE wildcards (%,_) with POSIX regex metacharacters and match the entire string. DataFusion was loweringSIMILAR TOdirectly to aRegexMatchover the pattern verbatim, so:'abc' SIMILAR TO 'a%'returnedfalse(PG:true) because%was passed through as a literal regex character.'abc' SIMILAR TO 'b'returnedtrue(PG:false) because the regex match isn't anchored..,^,$were treated as regex metacharacters instead of literals.What changes are included in this PR?
translate_similar_to_pattern()helper indatafusion/physical-expr/src/expressions/binary.rsthat converts aSIMILAR TOpattern into an equivalent POSIX regex:^...$.%→.*,_→...,^,$(literal in SIMILAR TO, meta in regex).|,*,+,?,(),{m,n},[...]) through unchanged.\as an escape for the next character.]/^].datafusion/physical-expr/src/planner.rs) now translates literalUtf8/LargeUtf8/Utf8Viewpatterns at planning time. NULL patterns flow through as a typedUtf8null. Non-literal patterns return a clearnot_impl_errrather than the previous silently-wrong behavior.datafusion/sql/src/expr/mod.rs) pattern-type check widened to acceptLargeUtf8andUtf8Viewliterals (previously rejected even though the underlying regex match supports them).Are these changes tested?
Yes:
similar_to_pattern_translationinbinary.rscovers wildcards, anchoring, regex metas, literal./^/$, backslash escapes, bracket expressions (including[]abc]and[^]abc]).datafusion/sqllogictest/test_files/strings.slthas a new regression block exercising the bug-report case,_wildcard, anchoring, literal./^/$, regex metas (|,{m,n},+), backslash-escaped wildcards,NULLpattern, and the non-literal-pattern error.SIMILAR TO 'p[12].*'test instrings.sltwas relying on the buggy regex-passthrough behavior (p1e1etc. matched only because.was treated as regex.); it's been changed to'p[12]%'which expresses the same intent under correct SIMILAR TO semantics.Are there any user-facing changes?
Yes —
SIMILAR TOis now PostgreSQL-compatible:%,_, or unanchored matches. Existing patterns that relied on the previous regex-passthrough behavior may need to be updated (most obviously, change.*to%).Not yet implemented: SIMILAR TO with a non-literal pattern is not yet supportedinstead of silently returning a regex match. This was almost certainly broken in practice already, but it is a visible error message change.